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ABSTRACT 

Joint audio-visual speaker tracking requires that the lo¬ 
cations of microphones and cameras are known and that 
they are given in a common coordinate system. Sensor self¬ 
localization algorithms, however, are usually separately devel¬ 
oped for either the acoustic or the visual modality and return 
their positions in a modality specific coordinate system, often 
with an unknown rotation, scaling and translation between the 
two. In this paper we propose two techniques to determine 
the positions of acoustic sensors in a common coordinate 
system, based on audio-visual correlates, i.e., events that are 
localized by both, microphones and cameras separately. The 
first approach maps the output of an acoustic self-calibration 
algorithm by estimating rotation, scale and translation to the 
visual coordinate system, while the second solves a joint sys¬ 
tem of equations with acoustic and visual directions of arrival 
as input. The evaluation of the two strategies reveals that joint 
calibration outperforms the mapping approach and achieves 
an overall calibration error of 0.20m even in reverberant 
environments. 

Index Terms — coordinate mapping, absolute geometry 
calibration 

1. INTRODUCTION 

Advanced teleconferencing systems, smart rooms or surveil¬ 
lance and monitoring systems are example applications of dis¬ 
tributed audio-visual sensor networks. For many tasks, such 
as automatic camera steering, events or objects of interest 
have to be localized either acoustically, visually or jointly, 
which in turn requires that the positions of the sensors need 
to be known. While the sensor positions can be determined 
manually, it is more convenient to do so automatically, in par¬ 
ticular if they can change over time, e.g., if a smartphone, 
which is part of the network, is carried by a moving person. 

Automatic geometry calibration of sensors is typically re¬ 
alized by localizing and tracking an object and subsequently 
determining the position of the sensors, such that the measure¬ 
ments of the object’s positions are most plausible. 
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Visual calibration algorithms work on features extracted 
from the camera images. They can be divided into two cat¬ 
egories m. The first one tries to extract these features from 
easily recognizable objects 0, whereas the second group ex¬ 
tracts features from an arbitrary scene to compare the field of 
view of the individual cameras 13. 

For acoustic sensor nodes time of flight (ToF) based al¬ 
gorithms, which employ special calibration hardware and sig¬ 
nals to achieve high positioning accuracies uma, have been 
proposed. However, a tight clock synchronization between 
transmitter and receiver is required, whereas 0 relaxed this 
limitation by estimating the differences in the sampling phase 
and the sensor positions jointly. If the calibration is based on 
time difference of arrival (TDoA) measurements, loudspeaker 
and microphones need no longer be synchronized, and a hu¬ 
man speaker can be used as sound source. However, a clock 
synchronisation of the A/D converters of the distributed mi¬ 
crophones is still required. Even this requirement becomes 
obsolete if direction of arrival (DoA) based techniques are em¬ 
ployed HO. TDoA and DoA based calibration if carried out 
with artificial calibration signals with appropriate correlation 
properties, will typically achieve higher accuracy compared 
to speech signal based approaches |0- Calibration based on 
natural speech is preferable from a usability point of view, as 
it can be carried out in the background unnoticed by the users 
of the audio-visual sensor network. 

Most geometry calibration techniques are unable to report 
the sensor positions in absolute coordinates. They return their 
estimates in a modality specific coordinate system, resulting 
in an unknown rotation, translation and scaling between the 
coordinate axes of the acoustic and visual sensor network. 
The scaling ambiguity can be fixed if ToA or TDoA measure¬ 
ments are employed inoiin]. If the calibration is based solely 
on DoA measurements, the scale ambiguity still remains, re¬ 
gardless of the modality used nulla. 

If the sensor positions of one modality are known, the dis¬ 
placement between the coordinate systems can be resolved by 
exploiting audio-visual correlates, i.e., events or objects that 
can be localized both acoustically and visually Gam. In 
this paper we build upon this idea and present two strategies 
to localize the acoustic sensors in a joint audio-visual coor¬ 
dinate system. Both the acoustic and visual localization is 



solely based on DoA measurements as they impose the least 
synchronisation requirements as detailed above. 

The hrst approach uses the existing acoustic sensor cal¬ 
ibration techniques from Q. Based on the relative geome¬ 
try estimates the speaker trajectory can be recovered with the 
intersection based approach from M, while simultaneously 
the speaker trajectory is estimated by the visual sensor net¬ 
work. By computing the optimal mapping between the acous¬ 
tic and visual trajectory we are able to reveal rotation, transla¬ 
tion and scale between both modalities. The second approach 
exploits the fact that the sensors of both modalities deliver 
DoA estimates. Thus, acoustic and visual measurements can 
be cast in a single system of equations to determine the acous¬ 
tic sensor positions, while the known visual sensor positions 
serve as anchor positions to eliminate the scale ambiguity. A 
key component of both DoA based calibration methods is the 
random sample consensus (RANSAC) outlier rejection algo¬ 
rithm QSl, which diminishes the impact of poor DoA esti¬ 
mates on the localization performance. In case of the joint 
calibration, this scheme will not only reject acoustic DoA out¬ 
liers, it will reject visual DoA outliers as well. 

In the next section we introduce the first approach based 
on a coordinate mapping, whereas Sec. [3 describes the joint 
calibration approach. The performance of both algorithms is 
evaluated in Sec. HI before Sec.|5]concludes this paper. 

2. COORDINATE MAPPING 

Our goal is the estimation of the coordinates of I acoustic 
sensors, where the coordinate system is dehned by the known 
positions of K visual sensors. The location of the fc-th visual 
sensor node is described in 2D by the position vector and 
orientation 7 ^. Now, consider a moving speaker located at 
position St at time t, who is seen by the visual sensors at 
DoAs 6k,t, k=l,... ,K. A position estimate is obtained 
from the DoAs by the intersection based technique presented 

in ITSl . 

The acoustic DoA estimates i=l, t=l ,..., T, 
captured from the same speaker trajectory are used to deter¬ 
mine estimates liii, i=l, • • • , of the acoustic sensor posi¬ 
tions and estimates 0 ^ of the orientations, using the calibra¬ 
tion algorithm from Q. This algorithm can only provide a 
relative geometry with an unknown scale factor. Therefore, 
only relative speaker position estimates are obtained, using 
the same intersection based method as above. 

Since the acoustic event locations are described in a 
different coordinate system as the visual estimates e*, there 
arises the following coordinate mapping problem: 

et = sRet-f d; t = ( 1 ) 

where s models the unknown scale factor and R and d the 
rotation and translation between the acoustic and the visual 
coordinate system. Mapping a set of points from one coordi¬ 
nate system to another is known as Rigid Body Transforma¬ 


tion (RBT). In contrast to the widespread approach from El 
to compute the RBT parameters (scale, rotation and transla¬ 
tion) via a Singular Value Decomposition (S VD) we suggest a 
computation in the Discrete Fourier transform (DFT) domain, 
which turned out to be computationally more efficient. 

Hence, we introduce a complex representation of the esti¬ 
mated speaker positions as Ut=ei,t+ie 2 ,t and 'Ut=ei,t+je 2 ,t 
respectively, where et=(ei,t, 62 ,t) and et={ei,t, 62 ,*) are the 
two-dimensional speaker positions in the acoustic and visual 
coordinate system, respectively. Thus, the mapping problem 
ofEq. O is expressed as 

Vt = aut -f a, /3 e C. (2) 

The absolute value and the phase of a correspond to scale 
and orientation, while /3 corresponds to the translation. Ar¬ 
ranging all observations into vectors v=[ui,..., and 
u=[ui,..., the least squares estimate for the RBT pa¬ 
rameters in the complex space is given by 

{a*,13*) = argmin (au + 131— v)^ (au -b /31 — v), (3) 
a,/5 

where 1 denotes an T-element vector of ones and (-j^ the 
complex conjugate transpose of a vector. 

Let X and y denote the DFTs of u and v. The optimization 
problem of Eq. ([^i is expressed in the DET domain as 

{a*,13*) = argmin (ax -b /3z — y)^ (ax -b /3z — y), (4) 
Q(,/3 

where z= [l, 0,..., O]^ is a vector of length T. Due to the 
orthogonality properties of the DET the joint optimization is 
decoupled into two separate optimizations: 

a* = argmin (ax 2 :T—y 2 :T)^ (Q:X 2 :r—y 2 :T) and (5) 

Q! 

(3* = argmin {a*xi+j3—yi)^ {a*xi+j3 — yi), ( 6 ) 

/3 

where the first bin of the DETs is denoted by (-ji and all other 
bins by {■) 2 -.t- Since Eq. Q and Eq. (| 6 l) are general least 
squares problems, the solution is found to be 

a* = x^^y 2 :T/ {xf,TX 2 -.T) and (3* = y^ - a*xi. (7) 

The RBT parameters can be retrieved as follows: 


-HfV 

3 ,nd d — — 



N 
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where JR and S denote real and imaginary part, respectively. 
If this transformation is applied to the relative acoustic sen¬ 
sor position estimate 111 ^ according to Eq. O, the absolute 
acoustic sensor positions 111 ^ in the visual coordinate system 
are obtained. 

To summarize, the calibration algorithm to recover the 
acoustic sensor positions in the visual coordinate system con¬ 
sists of three steps. Eirst, run the relative acoustic calibration 






algorithm and estimate the speaker trajectory. At the same 
time, track the speaker in the visual domain. Secondly, com¬ 
pute the DFTs of both trajectories, evaluate Eq. O and com¬ 
pute the RBT parameters by Eq. ([ 8 ]). Einally, use the RBT pa¬ 
rameters to transform the acoustic sensor position estimates 
from the first step into the visual coordinate system. 

The DET based RBT parameter estimation delivers the 
same results as the conventional SVD based technique El, 
but our EET based implementation is twice as fast as the SVD. 


3. JOINT CALIBRATION 


Since the acoustic and the visual sensors deliver DoA esti¬ 
mates, we propose to extend the calibration algorithm that 
was used for the acoustic sensors only in step one of the al¬ 
gorithm presented in the last section, to both modalities and 
jointly calibrate the audio-visual network. Due to the known 
positions of the visual sensors, the scale ambiguity vanishes. 

In the local coordinate system of the i-th acoustic sensor 
a DoA measurement can be modelled as a unit length vector 

fj,t = [cos(v?j,t) sm((/Ji,4)]'^, (9) 


pointing from the sensor position to the event location. This 
measurement vector will be compared with a prediction vec¬ 
tor 


cos - ©i) 

sin - ©i). 


( 10 ) 


where t= arg {e^ — mUi}, see Eig. [T] Eollowing our pre¬ 
vious publication jT] this prediction can be formulated as a 
function of the geometry parameters as follows: 


cos(©i) sin(©i) e* — mi 
-sin(©i) cos(©j)J lie* - mi 


By introducing the abbreviation 


( 11 ) 




( 12 ) 


i=l t=l 


and arranging the sensor positions, sensor orientations and 
events into matrices M=[mi,..., m/], ©=[©i,..., ©/] 



Fig. I. Geometric relation between acoustic sensor and event 
location. 


and E=[ei,..., e^] respectively, the geometry can be recov¬ 
ered by 

(M*,©%E*) = argmax{/}. (13) 

M.e.E 

The maximization problem of Eq. (fOl) can easily be trans¬ 
formed into a root-hnding problem, since fi j and fi t are unit 
length vectors. Subsequently, the minimization is carried out 
by Newton’s method. 

The formulation for the estimated and predicted DoA vec¬ 
tors hold for the visual sensors, too. Thus we define 


^k,t — 

cos(4.t) 

sin(4,t) 

and 

(14) 


cos( 7 fe) 

sin( 7 fc) 


(15) 

sin( 7 fc) 

cos( 7 fc)_ 

lie* -Cfeir 


with the only difference that the visual sensor positions 
and the corresponding orientations 7 ^ are known. Hence, the 
visual DoA measurements form additional constraints for the 
optimization of Eq. ( fT3l l. and we incorporate them to obtain a 
formulation which allows a joint audio-visual calibration: 

(M*,©*,E*)=argmax| / + V V ||gfe,tgfe.i|p| ■ (16) 

m.®,e [ ) 

The optimization of Eq. (IThl l is again turned into a root¬ 
finding problem in order to apply Newton’s method, where 
the visual measurements provide the required constraints to 
obtain an absolute sensor position estimate in the coordinate 
system defined by the visual sensors. 

In the noise free case, with perfect DoA measurements, 
the sensor positions and orientations can perfectly be recov¬ 
ered, but imperfect acoustic or visual DoA estimates caused 
by reverberation or false detections can prevent a success¬ 
ful optimization. Our earlier investigations presented in |j7] 
showed, that this issue can successfully be addressed by the 
RANSAC ifThl . Since the application of the RANSAC is 
straightforward we highlight only the relevant parts. The pro¬ 
cedure can be summarized as follows: 

1. Randomly select the minimal number of observations 
necessary to solve Eq. (O, e.g. T > 77 ^ 33 . 

2. Determine sensor positions and orientations based on 
the selected observations by solving Eq. (O. 

3. Compute the intersection of all DoA axes for each 
event. The hypothesized event location is the mean 
of all intersections. A DoA measured by a sensor 
becomes part of the candidate set C, if the average dis¬ 
tance of all its intersection points to the hypothesized 
event location is smaller than a threshold. 

4. If the number of elements in C is larger than the consen¬ 
sus set, estimate the sensor positions and orientations 
based on C. It becomes the new consensus if its error is 
smaller than the error of the current consensus. 

5. If the number of elements in C is smaller than consen¬ 
sus set, choose a new initial set or stop the algorithm as 
soon as the maximum number of iterations is reached. 













As a modification of this standard approach, we used the up¬ 
dated consensus set of step 4 as the input for the second step. 

4. SIMULATION RESULTS 

In order to evaluate the performance of both calibration strate¬ 
gies we used the following simulation framework. We simu¬ 
lated 3 random speaker trajectories, where the speaker stops 
at approximately 140 positions for 5 seconds before he moves 
on. The sensors are located in a room of size 6.2 m x 7.2 m. 
4 simulated cameras and 4 simulated five-element circular mi¬ 
crophone arrays (radius 5 cm) are located sufficiently far apart 
from the walls, where the cameras were oriented towards the 
center of the room. The microphone signals are generated 
by the Image Method iflSl . for reverberation times from 0 ms 
up to 500 ms. Acoustic DoA estimates are obtained by cor¬ 
relating the filter impulse responses of a filter-and-sum beam- 
former, which continuously adapts to the moving source QS). 

Rather than working on a true camera signal, visual 
DoA estimates are simulated as follows. We employ Hid¬ 
den Markov Models (HMMs) to describe the errors in the 
DoA estimation. A limited field of view of the camera is 
taken into account by dropping all angles outside a window 
of ±30° relative to the camera orientation. This effect is 
modelled by two separate HMMs. The first HMM is for the 
case that a speaker is inside the visible region of the camera. 
Here we distinguish the states ’detection’, ’missed detection’ 
and ’false detection’. The second HMM models the case 
that no speaker is inside the visible region. It incorporates 
the states ’false detection’ and ’no detection’. The transition 
probabilities of these models and the variance of the error 
distribution have been learned by computing histograms of 
oriented gradients (HOG) and applying a support vector ma¬ 
chine (SVM) to identify the head and shoulder region of the 
speaker on the AV16.3 audio-visual corpus ll20l . using the 
annotated sequences seqOl-lp-0000 and seql5-lp-0100. 

In order to perform a fair comparison between the ap¬ 
proaches presented in Sec. |2] and Sec. [3] the estimation of 
the RBT parameters is embedded into a RANSAC framework, 
too, since we have shown in ifOl that the RANSAC can boost 
the performance of the estimation of the RBT parameters. 

Since the RANSAC is a random process, we average over 
multiple runs. A sensor configuration is characterized by the 
positions and orientations. Fig.|2]compares the mean position¬ 
ing error (MPE) of the coordinate mapping based calibration 
(RBT) and the joint calibration strategy (Joint). It can be ob¬ 
served that the joint calibration clearly outperforms the RBT 
approach, in particular at low reverberation times. Obviously, 
it is advantageous to avoid premature decisions on acoustic 
source and sensor positions until the visual information is ac¬ 
counted for, as it is done in the joint calibration approach. 

The coordinate mapping approach has limited capabilities 
to determine a precise scale factor, and errors in the scale fac¬ 
tor dominate its performance. In order to isolate scale fac- 



Room reverberation time Tqq / [ms] 

Fig. 2. Comparison of mean positioning error (MPE) for 
joint audio-visual calibration (joint), calibration by coordinate 
mapping (RBT) and coordinate mapping with an oracle infor¬ 
mation (RBT + oracle). 

tor estimation errors from orientation and translation errors 
we performed an oracle experiment, where the scaling is as¬ 
sumed to be known. Indeed, the performance is now similar 
to that of the joint approach for low reverberation times and 
superior in a highly reverberant environment. The sensor ori¬ 
entation error of both approaches is approximately the same 
and smaller than 2° for all reverberation times. 

To achieve precise calibration results a suitable spatial 
event configuration is more important than the total number of 
available events. Thus, we selected 15 events with an appro¬ 
priate configuration of one exemplar trajectory and perform 
a joint calibration. The results of Tab. [T] show that a similar 
performance as in the previous experiment, which used the 
complete trajectory, is possible. 
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Table 1. Joint calibration using 15 events with appropriate 
spatial configuration. 


5. CONCLUSIONS 

We have described two different strategies to obtain an ab¬ 
solute calibration of an acoustic sensor network if it is com¬ 
bined with a visual sensor network, whose sensor positions 
are known. By using one of the two strategies, the scaling 
problem identified in earlier publications Eiiiini can be 
solved. The first approach, which relies on the mapping of an 
acoustic to a visual speaker trajectory, works with arbitrary 
acoustic calibration strategies and is therefore very flexible. 
However, the performance is limited due to the scale estima¬ 
tion errors. The second approach, which is based on the so¬ 
lution of a system of nonlinear equations employing acoustic 
and visual DoA measurements, is computationally more com¬ 
plex. It outperformed the first approach for all reverberation 
times and delivered a calibration error smaller than 0.20 m 
and 2° even in reverberant environments. 
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